R is an open source language developed by Ross Ihaka and Robert Gentleman, and is based on S, an earlier programming language used for statistical computing. (source). Today R is one of the most popular tools used in data analysis and statistics.
This class will cover a basic introduction to R. Our goals are to provide a resource for learning to use R for plant epidemiology research. In the intro, we provide an explanation of basic, but important R tools and functions to give learners the background they need for the rest of the material. Please note that additional tools and functions will also be introduced as the course progress.
There are plenty of free, online resources available. Below are a few recommendations:
Remember, R is a software language, and just like learning a foreign language, it requires time and committment to learn. Once you have learned some of the basics, you will be able to continue adding new “words” and capabilities to your vocabulary and continue to build a mastery of R.
R is freely available to download here from the R Foundation. R is supported by Windows, Linux, and Mac.
Figure, R. A screenshot of the R GUI interface: (1) is the console, where codes are entered and read by the software; (2) is the script or code source, where the user enters information that can be copied and pasted into the console; and (3) is a window with graph output.
Users typically interact with R through RStudio, an IDE (integrated development environment). RStudio is a user-friendly software used to make it easier for people to use R. It can be reliably downloaded here RStudio. The software is highly customized and includes many additional plugins(add-on pieces of code that provide additional functions).
Figure, IDE-RStudio. A screenshot of RStudio: (1) is a window with multiple pieces of information, for example, files, plots, packages, and help tools; (2) is the location of the source or script; (3) is the R console; (4) is a window with objects, coding history and more
R can be used as a simple calculator:
# Addition:
2+2
## [1] 4
# Subtraction:
10-5
## [1] 5
# Multiplication:
15*2
## [1] 30
# Division:
20/10
## [1] 2
# Exponents:
3^3 # or 3**3
## [1] 27
# Division to result in a whole number:
10%/%3
## [1] 3
# Returning the remainder from the above operation:
10%%3
## [1] 1
There are many more calculations that can be performed with functions. A function is a piece of code that performs a certain task on data provided, and typically returns some result.
There are many more mathematical calculations and manipulations you can accomplish with functions from base R:
# Mean:
mean(c(4,10))
## [1] 7
# Standard deviation:
sd(c(2,2,4,4,2,2))
## [1] 1.032796
# Square root:
sqrt(9)
## [1] 3
# Natural log:
log(100)
## [1] 4.60517
# Using log and defining the base (here base of 10):
log(100, base = 10)
## [1] 2
# Exponentials:
exp(5)
## [1] 148.4132
# Absolute value
abs(-7)
## [1] 7
All data in R is considered an ‘object’. Objects can mean a singular value, such as a number, or it can be a collection of values. You’ll learn more about different kinds or ‘classes’ of objects in the next section.
Objects need to be assigned names to be used. To do this, use the
<- or = signs. Typically, the arrow
<- is preferred since it does avoid errors in very specific
situations. There are keyboard shortcuts for <-. With Windows,
press ‘Alt’ and ‘-’ # simultaneously (ex. Alt + -). With Mac, press
‘Option’ and ‘-’ (ex. Option + -).
# You can use = to assign a value:
a = 2
a
## [1] 2
# You can also use <- to assign a value:
b <- 10
b
## [1] 10
# You can use the assigned objects to do calculations:
c <- a+b
c
## [1] 12
# An object can also be a character string (a series of letters or numbers), or even multiple character strings!
d <- "R is fun"
d
## [1] "R is fun"
e <- c("R is fun","So is plant pathology!")
e
## [1] "R is fun" "So is plant pathology!"
# Objects can contain multiple numbers:
f <- c(1,2,3,4,5) # the c means concatenate
f
## [1] 1 2 3 4 5
# Objects and numbers can be grouped into a table (more on this later):
g <- data.frame(First = c(1,2,3,4,5),
Second = c("A", "B", "C", "D", "E"))
g
## First Second
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
Although you can give almost any name for an object in R, there are some words that cannot be used and others that should be avoided.
# You cannot use ONLY a number as object name:
1 <- 2
a1 <- 2
# Some words are reserved for specific functions and cannot be used:
NA <- 200
NaN <- 150
TRUE <-100
FALSE <- 50
Inf <- 0
# Other keywords that cannot be used as object name include:
# "break", "else", "for", "if", "next", "repeat", "return", and "while".
# You also cannot use specific keyboard symbols, such as /, @, %, !, etc.:
UCR/fito <- 100
UCR@fito <- 100
UCR%fito <- 100
## Error: <text>:21:4: unexpected input
## 20:
## 21: UCR%fito <- 100
## ^
It is good practice to label your objects in an intuitive and descriptive manner, and to follow similar patterns.
For example, if you have multiple objects such as monthly temperature
reports, it would be considered preferable to use something
like:temp_jan, temp_feb,
temp_mar, etc. These names are very intuitive and
descriptive, and follow the same pattern (variable, underline
separation, month with 3 characters), and will be quickly understandable
for someone with some familiarity of the data.
On the other hand, using names such as var1,
col1, Object 1, make it very difficult for
someone to understand the coding without additional information, or for
you to understand your own code after a break. Avoid changing the name
separation (e.g.temp_jan, temp.feb) and take
extra care with the use of capitalization (upper and lower case
letters), since R is case-sensitive and this can often lead to errors in
your code or difficulties in running specific analyses. For example,
temp_Jan and Temp_Feb both work, but may add
an additional layer of coding complexity that is not necessary.
When possible, we also recommend avoiding giving names to objects that are also used as functions (more about functions below).
# The mean() function calculates the mean of a set of values or of a numeric object:
mean(c(1, 2, 3, 4, 5))
## [1] 3
# If you also use 'mean' as an object name, you and others may get confused, and depending on the circumstances, you may get a weird result:
mean <- c(1, 2, 3, 4, 5)
mean(mean)
## [1] 3
We can give names to plots (data visualization will be covered in the next section) and many other data formats (data frames, matrix, vector, list, etc). This allows us to reference these plots and other objects in future manipulations.
# We will learn how to create better plots later, but just for example:
# Load ggplot2, a package that is useful for making high-quality plots:
library(ggplot2)
# Assign some arbitrary data to object 'data_plot':
data_for_plot = data.frame(x=c(1,2,3,4,5,6), y=c(1,2,3,4,5,6))
# Create the data object 'plot_name' with instructions on how to build a plot:
plot_name = ggplot(data_for_plot, aes(x = x, y=y)) + geom_point()
# Now that our plot has a name, we can easily view it:
plot_name
# By using our plot's name (plot_name), we can save our plot or add other features to it without re-creating it.
# Save our plot:
ggsave("our_plot.png", plot=plot_name, device="png")
## Saving 7 x 5 in image
# Add more aesthetic detail to our plot:
plot_name + geom_point(aes(colour=x))
From a practical point of view, the most important data types/classes are:
1.2, 3.141593,
10.0, and -5.21, 2, -4, an
500TRUE or FALSE, or T and
F for short"PA",
"Costa Rica", "soybean". But, these can also
refer to other numbers ("1.2", '10',
'-15'), or logical values ("TRUE",
"FALSE")Some values, such as numbers, can be coerced to a numeric data class
from a character data class and vice versa with functions like
as.numeric() and as.character().
NOTE: For practical purposes, we simplified these types of
data as part of the introduction. In R, these are called atomic vectors.
There are also two other types of atomic vectors which are rarely used
and will not be mentioned here (complex and
raw). If you would like to explore this in more detail, we
recommend Chapter 3 and 13 of Wickham book, Advanced R.
# Let's create an example of a group of numeric values:
numeric_example <- c(1.2, 1.5, 3.14, 2.7182)
# You can directly ask if object 'numeric_example' is numeric with the is.numeric() function:
is.numeric(numeric_example)
## [1] TRUE
# Or determine the data class of the object with the class() function:
class(numeric_example)
## [1] "numeric"
# Let's start by creating a vector of numbers:
integer_example <- c(1,5,7,-4)
# See if the new vector is an integer-class data type:
is.integer(integer_example)
## [1] FALSE
# If is not integer, what is the class type?
class(integer_example)
## [1] "numeric"
# When you information as numbers, R will assume that is numeric.
# We have to inform R that that the values are integers by using the function as.integer()
integer_exemple <- as.integer(c(1,5,7,-4))
is.integer(integer_exemple)
## [1] TRUE
class(integer_exemple)
## [1] "integer"
# Now, if you enter information that has decimals and force the result to be an integer,
# R will ignore the decimal information
as.integer(c(3.1, 4.9, 5.499999, 9.99999))
## [1] 3 4 5 9
# If you enter information as words (= class, character) and try to force this to be an integer
# R will return an error
as.integer(c("Sunday", "Monday", "Tuesday", "Wednesday"))
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
# If you enter numerical values in character format, it works because numbers can be character values
number_character <- c("1", "2", "3")
class(number_character)
## [1] "character"
as.integer(number_character)
## [1] 1 2 3
class(as.integer(number_character))
## [1] "integer"
# This type of transformation is very useful for some data loading in R where the values can be considered as characters
# As mentioned earlier, logical values can be only be considered as TRUE or FALSE:
logical_example <- c(TRUE, FALSE, T, F)
is.logical(logical_example)
## [1] TRUE
# Logical operations are very important for situations where you want to check specific conditions.
# For example, if one number is greater than then other:
5 > 10 # five is greater than 10?
## [1] FALSE
# Also, if character strings are "equal" or identical:
"Sunday" == "Monday" # Sunday is equal to Monday?
## [1] FALSE
identical("Sunday","Monday")
## [1] FALSE
# You can also check that character strings are not 'equal' or identical:
"Sunday" != "Tuesday" # Sunday is different to Tuesday?
## [1] TRUE
# Here is an example where R interprets that the character "2" should actually be numeric, allowing the logical statement to be interpreted as TRUE:
1 < "2"
## [1] TRUE
# Character values are basic word elements (including numbers) or phrases:
character_example <- c("banana", 'epidemiology', "TRUE", "11", "A character can be more than a single word")
is.character(character_example)
## [1] TRUE
# Character values can be defined using either quotation marks ("") or apostrophes (''):
character_example
## [1] "banana"
## [2] "epidemiology"
## [3] "TRUE"
## [4] "11"
## [5] "A character can be more than a single word"
# But if you forget to include the " or ', R will give you an error message because R thinks that it is an object!
# Compare:
character_example <- c("banana", epidemiology)
## Error in eval(expr, envir, enclos): object 'epidemiology' not found
# vs.
character_example <- c("banana", "epidemiology")
# Factors are ordinal, in contrast to character values. This means they have a defined order or level.
factor_example <- as.factor(c("red", 'blue', "green", "red", "red"))
class(factor_example)
## [1] "factor"
# Our example has an order and has levels. In this case, the 'order' matches the sequence in which the values were introduced above. R automatically sets the 'levels' by alphabetical order.
factor_example
## [1] red blue green red red
## Levels: blue green red
# A useful function is factor(). With factor(), you can define the order, level, and other attributes. For example, you can change the levels from alphabetical order 'blue','green','red' to any other configuration:
factor_example2 <- factor(c("red", 'blue', "green", "red", "red"), levels = c("red", "blue", "green"))
factor_example2
## [1] red blue green red red
## Levels: red blue green
# You can also change the name of the elements by changing the level label
factor_example3 <- factor(c("red", 'blue', "green", "red", "red", "yellow"),
levels = c("red", "blue", "green", "yellow"),
labels = c("RED", "Blue", "green", "green"))
factor_example3
## [1] RED Blue green RED RED green
## Levels: RED Blue green
# The result from us adding a new color, yellow, is that we redefined what 'yellow' means. # In this case, 'yellow' now results in green. We also changed the format for 'red' and 'blue' to include upper-case letters.
Tip: The order and levels of a factor-class object can dictate the order in which later functions are applied and how resulting plots are formatted.
# The most common date format is POSIX (Portible Operating System Interface).
# POSIX describes the date and time, to the millisecond in a string format:
as.POSIXct("2021-04-25 11:30:45")
## [1] "2021-04-25 11:30:45 EDT"
# You can use different time zones and date formats by defining format and tz parameters. # In the example above, we write the date as year-month-date. We can easily convert other # date formats and define time zone with the as.POSIXct() function, as shown in the next example.
# The date in New Zealand may be written as day/month-year. We tell the as.POSIXct() function to expect this format by typing format = "%d/%m-%Y. We also tell the as.POSIXct() function that the time zone is in New Zealand with tz = "NZ". If the tz parameter is not defined, R assumes the time zone.
as.POSIXct("25/04-2021 14:30:45",
format = "%d/%m-%Y %H:%M:%OS",
tz = "NZ")
## [1] "2021-04-25 14:30:45 NZST"
# If the tz parameter is not defined, R assumes the time zone.
# Also, note that we used /(slash) instead of - (dash). This is to demonstrate that R can deal with date information in many different formats that data may come in.
# This means that we can use different formats, and as.POSIXct() function still will print information using the default POSIX format.
# There are two functions which we can use: as.POSIXct() and as.POSIXlt(), which create
# different classes of dates and times.
ct <- as.POSIXct("2021-04-05 11:30:45")
lt <- as.POSIXlt("2021-04-05 11:30:45")
class(ct)
## [1] "POSIXct" "POSIXt"
class(lt)
## [1] "POSIXlt" "POSIXt"
# Although they are technically different classes of objects, both outputs look the same:
ct
## [1] "2021-04-05 11:30:45 EDT"
lt
## [1] "2021-04-05 11:30:45 EDT"
# The difference between these two classes are in their internal components. Internally, ct retains the number of seconds since 1970-01-01 and is generally preferable for use in data sets. lt retains the # full date and time in it's format, and is generally more readable to humans.
unclass(ct) # The big number is the total of seconds since 1970-01-01
## [1] 1617636645
## attr(,"tzone")
## [1] ""
unclass(lt)
## $sec
## [1] 45
##
## $min
## [1] 30
##
## $hour
## [1] 11
##
## $mday
## [1] 5
##
## $mon
## [1] 3
##
## $year
## [1] 121
##
## $wday
## [1] 1
##
## $yday
## [1] 94
##
## $isdst
## [1] 1
##
## $zone
## [1] "EDT"
##
## $gmtoff
## [1] NA
##
## attr(,"tzone")
## [1] "" "EST" "EDT"
## attr(,"balanced")
## [1] TRUE
# Despite the difference in internal structure, it is possible to extract specific information, for example, year, month, day, etc. from both.
weekdays(ct)
## [1] "Monday"
months(lt)
## [1] "April"
quarters(ct)
## [1] "Q2"
# Another function and class we can use for date information is as.Date(), which
# produces a Date object.
dt <- as.Date("2021-04-25 11:30:45")
dt
## [1] "2021-04-25"
class(dt)
## [1] "Date"
# As you can see, they are similar to POSIX functions. However, they include only the specified information. For example, if you only provide the date, only the date will be stored. This is in contrast to the as.POSIX() functions, which will infer and store a time zone.
dt <- as.Date("2021-04-25")
dt
## [1] "2021-04-25"
ct <- as.POSIXct("2021-04-25")
ct
## [1] "2021-04-25 EDT"
Now that we understand the most important types of data, we can learn about how these types of data can be grouped for analyses. R is quite flexible in terms of grouping data. There are basically six different data structures in R:
x <- 2; y <- "beans"; b
= x > yThe figure below illustrates the differences between these five data structure types:
Types of data structure in R. Imagen source
# You can use the function c(), which means concatenate, to create vectors
x <- c(1, 2, 3, 4)
y <- c("a", "b", "c", "d")
x
## [1] 1 2 3 4
y
## [1] "a" "b" "c" "d"
# You can select an element within a vector using square brackets [].
y[3] # This will extract and provide the third element in vector y.
## [1] "c"
# Here's an example of a matrix containing numbers 1 to 12. With 'ncol' and 'nrow' arguments, we construct a matrix with 3 columns and 4 rows:
matrix_a <- matrix(1:12,ncol=3,nrow=4)
matrix_a
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
# To extract elements from a matrix, you have to specify first the row and then the column:
matrix_a[2,3]
## [1] 10
# If we don't know how big a matrix is, we can see its dimensions using the dim() function:
dim(matrix_a)
## [1] 4 3
# Names can be added to rows and columns by using the option 'dimnames':
matrix(1:12,nrow=4,ncol=3 ,
dimnames = list(c("A", "B", "C", "D"),
c("X", "Y", "Z")))
## X Y Z
## A 1 5 9
## B 2 6 10
## C 3 7 11
## D 4 8 12
# Arrays can be constructed with the 'dim' argument specifying the number of rows, columns, and matrices.
array_a <- array(1:36,dim=c(3,4,3)) # 3 matrices with 3 rows and 4 columns each.
array_a
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 25 28 31 34
## [2,] 26 29 32 35
## [3,] 27 30 33 36
# Note that the array is presented as multiple matrices, allowing us to see all of its values.
# You can also add row, column, and matrix names with the 'dimnames' argument.
array_a = array(1:36,dim=c(4,3,3),
dimnames = list(c("A", "B", "C", "D"),
c("X", "Y", "Z"),
c("First", "Second", "Third")))
# Similar to vectors and matrices, you can extract elements by using [ ] (brackets)
array_a[2,1,2] # This is the value at row 2, column 1, of matrix layer 2.
## [1] 14
array_a[1,1,1] # This is the value at row 1, column 1, of matrix layer 1.
## [1] 1
# If you do not indicate one or more of the values, R will collect for all of the other dimensions not specified
array_a[,1,] # This extracts the first column from all matrices in the array.
## First Second Third
## A 1 13 25
## B 2 14 26
## C 3 15 27
## D 4 16 28
array_a[1,1,] # This extracts the value in row 1, column 1 of all 3 matrices in array_a.
## First Second Third
## 1 13 25
Data frames can have vectors of different data types. This means that data frames can consist of a column of character values, a column of numeric values, and a column of logical data values, all in one data structure.
# Here we construct a numeric vector, a character vector, and a logical vector.
vec_numer <- c(1,2,3,4,5)
vec_char <- c("A", "B", "C", "D", "E")
vec_logic <- c(T, F, T, F, T)
# We use the data.frame() function to group our vectors together in a data frame.
df <- data.frame(vec_numer, vec_char, vec_logic)
df
## vec_numer vec_char vec_logic
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C TRUE
## 4 4 D FALSE
## 5 5 E TRUE
# The str function will show the class of the vectors composing the data frame.
str(df)
## 'data.frame': 5 obs. of 3 variables:
## $ vec_numer: num 1 2 3 4 5
## $ vec_char : chr "A" "B" "C" "D" ...
## $ vec_logic: logi TRUE FALSE TRUE FALSE TRUE
# To select an column in data.frame you can use $ symbol.
df$vec_char
## [1] "A" "B" "C" "D" "E"
# Just like in other data structures, we can use the square brackets [] to extract values in specific rows and columns.
df[2,2] # This is the value in row 2, column 2 of the data frame.
## [1] "B"
A list is the most complex type of data structure. Lists can hold all the other data structures in a single list object.
# Here we construct a scalar, numeric vector, matrix, array, and data frame containing both numeric and character values and store them all in a list:
a <- 2
b <- c(1,2,3,4,5)
c <- matrix(1:20,4,5)
d <- array(1:40, c(4,5,2))
e <- data.frame(numbers = c(1:5),
characters = LETTERS[1:5])
first_list <- list(a, b, c, d, e)
first_list
## [[1]]
## [1] 2
##
## [[2]]
## [1] 1 2 3 4 5
##
## [[3]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## [[4]]
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 21 25 29 33 37
## [2,] 22 26 30 34 38
## [3,] 23 27 31 35 39
## [4,] 24 28 32 36 40
##
##
## [[5]]
## numbers characters
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
# We can inspect the composition of the list with str(), just like we can with data frames:
str(first_list)
## List of 5
## $ : num 2
## $ : num [1:5] 1 2 3 4 5
## $ : int [1:4, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
## $ : int [1:4, 1:5, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
## $ :'data.frame': 5 obs. of 2 variables:
## ..$ numbers : int [1:5] 1 2 3 4 5
## ..$ characters: chr [1:5] "A" "B" "C" "D" ...
# You can add names to each data structure:
second_list <- list(scalar = a, vector = b, matrix = c, array = d, data_frame = e)
second_list
## $scalar
## [1] 2
##
## $vector
## [1] 1 2 3 4 5
##
## $matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## $array
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 21 25 29 33 37
## [2,] 22 26 30 34 38
## [3,] 23 27 31 35 39
## [4,] 24 28 32 36 40
##
##
## $data_frame
## numbers characters
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
# A list can include another list:
third_list <- list(scalar = a, vector = b, data_frame = e, my_list = second_list)
third_list
## $scalar
## [1] 2
##
## $vector
## [1] 1 2 3 4 5
##
## $data_frame
## numbers characters
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
##
## $my_list
## $my_list$scalar
## [1] 2
##
## $my_list$vector
## [1] 1 2 3 4 5
##
## $my_list$matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## $my_list$array
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 21 25 29 33 37
## [2,] 22 26 30 34 38
## [3,] 23 27 31 35 39
## [4,] 24 28 32 36 40
##
##
## $my_list$data_frame
## numbers characters
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
# Using [] (brackets), we can extract different components/elements from within the list
third_list[[3]] # Extract the 3rd component/element, data frame 'e'.
## numbers characters
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
third_list[[3]][2] # Extract the 2nd column (characters) of the 3rd element (data frame 'e').
## characters
## 1 A
## 2 B
## 3 C
## 4 D
## 5 E
third_list[[3]][1,2] # Extract the value in the 1st row and the 2nd column of the 3rd element.
## [1] "A"
Functions are one of core components of the R language. They help to make data analyses easier, especially when trying to automate some processes.
Let’s start with a simple example of defining our own function
FUN_weather():
FUN_weather <- function(x){
daily_weather = paste0("Today the weather is ", x)
print(daily_weather)
}
FUN_weather("good")
## [1] "Today the weather is good"
FUN_weather("rainy")
## [1] "Today the weather is rainy"
FUN_weather("hot!!!")
## [1] "Today the weather is hot!!!"
We can divide our functions into three components: name, arguments, and body.
FUN_weather(). As always, we recommend that you use names
that are intuitive and follow some pattern. In FUN_weather, we used FUN
to indicate that we have created a function (or functions). The part
that says weather indicates that the function will have
something to do with weather.FUN_weather() and include all
arguments inside the parentheses. In our example, we have only one
input, x, although it is often very common to have multiple
inputs.FUN_weather(). Often, you will not see the full body of
functions unless you either create them or tweak how an already-existing
function works. The body is where you tell your function what you want
it to do. You can create objects within the body of the function (like
‘daily_weather’) and perform operations on that object, including
printing it like we did above. R will process the entire body of the
function, from the first line to the last, with the last line dictating
the resulting output from the function. In this example, the output is
“Today the weather is good” when the input is “good”.NOTE: There is one more component in the operation of a function: the environment. For the majority of the cases you don’t need to use it (~99.9% of the cases), so we will not cover this part here. But if you want to learn more, we recommend you consult Hadley Wickham’s book, Advanced R.
In the next example, the function was constructed to decide if is a
good day to go outside. For this function we use multiple arguments to
make the decision. There are a lot of new things in the code, but for
now let’s focus on the function itself. To be able to go outside, three
conditions must be met: (i) the temperature must be equal to or greater
than 22C (variable X); (ii) the day must be sunny (y); and (iii) you
cannot be busy, meaning you have the time to go outside (z). The
function ifelse(), inside of FUN_go_out() will
return “YES” if all conditions are meet, and “NO” otherwise. This
information is saved in the object called cond. Depending
on the result, this information will be pasted in the resulting outcome
with function paste0().
# Here we create the function 'FUN_go_out'. This function will tell us if it's a good idea to go outside based on temperature (x), rain (y), and whether you are busy or not (z).
FUN_go_out <- function(x = 15, y = "rain", z = "busy"){
cond <- ifelse(x >= 22 &
y == "sunny" &
z == "not_busy",
"YES", "NO")
output<- paste0("Is it a good day to go out? ", cond)
print(output)
}
# We can use the function we've created above by calling the function and giving it the inputs '15','rain', and 'busy':
FUN_go_out(15,"rain","busy")
## [1] "Is it a good day to go out? NO"
Another difference from the first function, ‘FUN_weather()’, is the
use of pre-defined values for the arguments (x=15, y=“rain”, z=“busy”).
This means, that, unless otherwise specified, the values will always
take the form of x = 15, y = "rain", and
z = "busy". Now, let’s see what happen if we run the
function without changing any of the arguments (i.e., the default).
# Here we run the function without changing any pre-defined values:
FUN_go_out()
## [1] "Is it a good day to go out? NO"
In this case, R used the argument values that define the function. Since none of the criteria were met, the function output shows that it is not a good day to go out.
What if we change the parameters x, y, and z? What occurs now?
FUN_go_out(x = 25, y = "sunny", z = "not_busy")
## [1] "Is it a good day to go out? YES"
We get a different answer: it is a good day to go outside.
We can also run the same example with less typing. When using functions, you do not always need to use a label for each argument (in this case, x=25, y=“sunny”, and z=“not_busy”). R will assume that the inputs are in the same order used when defining the function (temperature, rain/no rain, busy/not busy).
FUN_go_out(25, "sunny", "not_busy")
## [1] "Is it a good day to go out? YES"
When an argument/input is missing, R will supply the default argument in place of the missing value.
# In this example, we do not define z as "busy" or "not_busy". R will use the default value (z="busy") that is used in the function's definition.
FUN_go_out(x = 25, "sunny")
## [1] "Is it a good day to go out? NO"
# The above line of code will produce the same result as if we had written:
FUN_go_out(x=25,"sunny","busy")
## [1] "Is it a good day to go out? NO"
You can enter the arguments/inputs in a different order as long as they are properly labeled. However, this is not recommended because it increases the potential of making a mistake.
# In this example, R knows which order x, y, and z should go in and is able to complete the computation:
FUN_go_out(z = "not_busy", x = 25, y = "sunny")
## [1] "Is it a good day to go out? YES"
# In this example, the arguments are not labeled. As a result, R does not know where to use each value.
FUN_go_out("not_busy",25,"sunny")
## [1] "Is it a good day to go out? NO"
The result is that R will tell you that it is NOT a good day to outside, although all of our conditions for going outside are met (temp>22, no rain, and not busy).
So far, the examples that we have shown are very simple. As we move into more complex methods, like linear regression or machine learning algorithms, we will have to increase the complexity of our functions. Many of the functions we will use as we progress have already been developed by others. You will still have the opportunity to improve your own skills of function development by understanding how many of these functions work and can be modified.
Now that we understand the different data types, data structures, and the basic form and use of functions, we can transition to learning about packages in the next section.
Formally, packages constitute the fundamental unit of what we define
as “shareable code” (Wickham
2015 - R Packages). It is basically the easiest way to share a
collection of custom functions and other elements (such as data and
documentation) between multiple users. By default, R comes with several
packages installed, such as, base and
graphics, but there are thousands of other packages
developed by people from different backgrounds, including plant
pathologists, and many of these packages were developed to help solve a
wide range of problems.
More simply put, packages add more function and features to base R. An example of a popular package is [ggplot2](https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf, which includes many features and functions that improve your ability to create high-quality custom graphics.
The packages available during any session in R can be viewed in the ‘Packages’ tab of the ‘Files’,‘Plots’,‘Packages’,etc. pane as seen below.
In RStudio, you can find what packages are installed in the tap “Packages”. Package name, a brief description, and version are showed
The most common way to install a package is through CRAN ([Comprehensive R Archive Network])(https://cran.r-project.org/). CRAN is a networked repository of packages.
Let’s illustrate installing and loading a package with the next example, where we install and load a package with many statistical functions named vegan.
# We used function 'install.packages' to securely download package 'vegan' from CRAN. The code for this package will be stored on your computer.
install.packages("psych")
# Now that package is stored on your computer, we use the 'library' function to make the package available in our R session:
library(psych) # load the package
# We can also use the function 'require' to accomplish this same task:
require(psych)
Note: # The difference between library() and
require() is that library() will stop
executing the code if there is an error. require() will
not. Typically, it is best practice for beginners to use
library() over require(). You can learn more
about this from Statology
Developers try to avoid using functions with similar names, but they do not always succeed. It is common that two different packages have functions with the same name. There are several ways to avoid these conflicts with the most common method being to dictate which package to use with ‘packagename::function()’. This is illustrated in the next example where function ‘t()’ is defined by base R and by a user-created function.
# Here we create an example matrix to perform functions 't' on:
matrix_1 <- matrix(1:9, 3,3)
matrix_1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
# Now, let us transpose this matrix (switch columns to rows and vice versa), using the function t() from base R, which is already installed and loaded:
t(matrix_1)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
# So far, so good! But now if we create a function called 't', we may obtain an answer we did not expect. In our new function 't', we mean to add 10 to every input (x) given to the function.
t <- function(x){
x+10 }
# Now if we try to use base R's function 't' to transpose our example matrix, it doesn't work properly:
t(matrix_1)
## [,1] [,2] [,3]
## [1,] 11 14 17
## [2,] 12 15 18
## [3,] 13 16 19
# Instead, 10 is added to every value within our matrix.
# Therefore, we need to dictate to R which 't' function should be applied by specifying the package name in the following format (package::function):
# If we want to use base 't' to transpose the matrix:
base::t(matrix_1)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Something very important to learn to use is the help()
function or question mark (?) with the function name. This
provides information related to the function, as well as information
about the package where it is located, and examples of its use.
# Here, we show how to use the 'help' and '?' functions to pull up information on a function 'cor.test':
help(cor.test)
## starting httpd help server ... done
?(cor.test)
Both commands will provide a description of cor.test()
function in another R pane, as shown below:
In RStudio, function description will be showed in the Help tap when use the function help() or ?
When we wrote this document (2021-3-25), there were 17,371 packages available in CRAN. These packages were built by different people, with different backgrounds, focused on different types of data analyses. Naturally, there can be issues with the compatibility from one package to another. With this in mind, the developers from RStudio started to build tidyverse, which is a group of packages that “share an underlying design philosophy, grammar, and data structures.”
There are several packages that are part of tidyverse, and everyday more packages that share tidyverse’s principles are being added to CRAN. tidyverse includes several ‘core’ packages that you can read more about here. The core packages included when tidyverse is installed and loaded are:
tidyverse and core packages figure. These core packages are installed and loaded with tidyverse automatically. Image source
# To install tidyverse:
install.packages("tidyverse")
# Load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will use many of the tools available with tidyverse as we go
through these educational materials. We will remind you about the
different packages as we work through the course.
So far, all of our example data was created in R. Realistically, we
work from databases developed as part of our research and import those
databases into R. These databases and other materials typically come in
these formats: .xlsx, .csv, txt,
etc. We will also want to export any results or graphs that are produced
as a part of our analysis in R. Here we will show you how to set your
work directory, the location on your computer where you keep your data
and where you will want your results and graphs to be stored. It is
important to set your work directory because an incorrect work directory
can complicate importing your data and can result in missing results and
graphs after your analysis.
First, we will show you how to find the default work directory R has
automatically set. To do this, we use the function
getwd().
# See the current work directory:
getwd()
## [1] "G:/My Drive/Professional/Side_projects/web_epidem"
If you want to change the work directory, there are a couple ways to
do it. The first way to change the work directory is by using the
function setwd() and indicating the exact path to
where you want set as your working directory. Your data will be in a
sub-folder at this location, and you will save your results and graphs
to another sub-folder.
One example showing how you might use a university-related “D” drive:
# Use the function 'setwd()' with the path to your desired work directory within quotations ("").
# setwd("D:/OneDrive_PSU/The Pennsylvania State University/Epidem class - PSU")
# NOTE: Windows uses a backslash (\) in paths. R uses only a forward slash (/). Leaving backslashes in paths causes errors, so make sure they are replaced with forward slashes.
Although using the function setwd() will set your work
directory, it is not always practical. For example, if you work on two
computers (one at home and another at work or the laboratory), or if you
are collaborating with others, the work directory will need to be re-set
every time the script is accessed by a new person. That is probably not
very practical!
The second way you can set a work directory properly (and avoid any
issues that arise with setwd()) depends upon you using
good practices for organizing and saving your work. Creating R
projects is a good way to maintain a good work flow.
Instructions for creating an R project:
Figure folders. An example of a folder organization for creaing a R project
.Rprojfile in your work
directory folder, similar with the one in the example below. The next
time that you use R, you can click on this file to open RStudio with
your work directory and previous work loaded.Now that we have defined a project, we need to import our data into
R. In the folder data we have an file called
data_demo.csv. We will read this file into R using the
function read_csv() from the package readr
(one of core packages of tidyverse). With this function, you will be
provided information about each vector (variable) type.
# Note that we don't have to type the full path to data_demo. This is because we have defined the work directory.
# That means that this:
# data_demo <- readr::read_csv("C:/Users/mnd20/web_epidem/data/intro_r/data_demo.csv")
# is equivalent to this:
data_demo <- readr::read_csv("data/intro_r/data_demo.csv")
## Rows: 64 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): trt, var
## dbl (7): plot, blk, sev, inc, yld, don, fdk
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_demo
## # A tibble: 64 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 A R 1 1.95 35 86.6 0.11 4
## 2 109 A R 1 1.2 20 80.3 0.15 4
## 3 204 A R 2 0.9 20 81.1 0 42
## 4 214 A R 2 4.05 50 85.1 0.07 25
## 5 305 A R 3 1.1 15 84.8 0 29
## 6 316 A R 3 1.8 10 93.7 0 24
## 7 406 A R 4 1.4 25 84.5 0.06 37
## 8 412 A R 4 0.45 10 84.8 0.07 42
## 9 108 A S 1 13.3 55 92.3 0.42 NA
## 10 110 A S 1 12.3 50 103. 0.26 NA
## # ℹ 54 more rows
Note that the full path (“D:/OneDrive_PSU/…data.csv”) will be
unique to each computer, and will need to be modified each time the
script is accessed by another person or on another computer. That is why
defining the ‘relative path’, the file’s path relative to the defined
work directory, is typically the best practice.
An instance where providing the full path name can be helpful is when
loading data not located within the defined work directory. For example,
other_data <- readr::read_csv("D:/OneDrive_PSU/Another_Folder/other_data.csv")
can be used to import data that does not exist within the work
directory. Don’t forget that each new person or computer will need to
update this path!
Other functions to read .csv files include the R base
function read.csv() and read.csv2() from the
base R package. For .xlsx (Excel) files, we
can use the functions read_excel(), from the package
readxl and for .txt (text) files, the function
read_table() from the package readr. While
these are the most common formats for data, there are a number of
packages that support other types of data.
Now that we have our data loaded, we can use R to summarize information, add new data pieces, or change the format of the database. During this section, we will focus on ideas related to data wrangling, or data manipulation.
It is often necessary to change the shape of your data (for example,
from long to wide), to filter the data, to make specific and documented
changes to the data, to summarize the data, and more. To do this, we
will use the tools available in the tidyverse packages. Specifically, we
will use tidyr
and dplyr, two
core tidyverse packages that offer expanded data manipulation and
cleaning capabilities.
Before we commence with data manipulation, it is important to have a look at the data.
# We can use the str() function (introduced in the data frame section of this page) to see information about the different variables in our demo data:
str(data_demo)
## spc_tbl_ [64 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ plot: num [1:64] 107 109 204 214 305 316 406 412 108 110 ...
## $ trt : chr [1:64] "A" "A" "A" "A" ...
## $ var : chr [1:64] "R" "R" "R" "R" ...
## $ blk : num [1:64] 1 1 2 2 3 3 4 4 1 1 ...
## $ sev : num [1:64] 1.95 1.2 0.9 4.05 1.1 1.8 1.4 0.45 13.3 12.3 ...
## $ inc : num [1:64] 35 20 20 50 15 10 25 10 55 50 ...
## $ yld : num [1:64] 86.6 80.3 81.1 85.1 84.8 ...
## $ don : num [1:64] 0.11 0.15 0 0.07 0 0 0.06 0.07 0.42 0.26 ...
## $ fdk : num [1:64] 4 4 42 25 29 24 37 42 NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. plot = col_double(),
## .. trt = col_character(),
## .. var = col_character(),
## .. blk = col_double(),
## .. sev = col_double(),
## .. inc = col_double(),
## .. yld = col_double(),
## .. don = col_double(),
## .. fdk = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
As you can see, the str() function produces a lot of
information. Important things to notice are that the dimensions of our
data are 64 rows by 9 columns. We also see that all variables are
numeric with the exception of variables trt and
var, which are character data.
Now, we can use the function summary() to look at a few
summary statistics of our data. We can also use the functions
head() and tail(), which are useful to show
information about the first and last rows in a data set,
respectively.
# Summary statistics:
summary(data_demo)
## plot trt var blk
## Min. :101.0 Length:64 Length:64 Min. :1.00
## 1st Qu.:179.8 Class :character Class :character 1st Qu.:1.75
## Median :258.5 Mode :character Mode :character Median :2.50
## Mean :258.5 Mean :2.50
## 3rd Qu.:337.2 3rd Qu.:3.25
## Max. :416.0 Max. :4.00
##
## sev inc yld don
## Min. : 0.000 Min. : 0.00 Min. : 80.30 Min. :0.00000
## 1st Qu.: 0.300 1st Qu.: 5.00 1st Qu.: 99.65 1st Qu.:0.00000
## Median : 0.775 Median :10.00 Median :106.35 Median :0.06000
## Mean : 1.648 Mean :15.86 Mean :105.56 Mean :0.07422
## 3rd Qu.: 1.837 3rd Qu.:20.00 3rd Qu.:114.38 3rd Qu.:0.10000
## Max. :13.300 Max. :60.00 Max. :123.20 Max. :0.42000
##
## fdk
## Min. : 0.00
## 1st Qu.: 4.00
## Median : 6.50
## Mean :10.61
## 3rd Qu.:15.50
## Max. :42.00
## NA's :2
# First lines:
head(data_demo)
## # A tibble: 6 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 A R 1 1.95 35 86.6 0.11 4
## 2 109 A R 1 1.2 20 80.3 0.15 4
## 3 204 A R 2 0.9 20 81.1 0 42
## 4 214 A R 2 4.05 50 85.1 0.07 25
## 5 305 A R 3 1.1 15 84.8 0 29
## 6 316 A R 3 1.8 10 93.7 0 24
# Last lines:
tail(data_demo)
## # A tibble: 6 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 207 D S 2 0 0 113. 0.21 4
## 2 211 D S 2 1.1 20 112. 0.07 3
## 3 309 D S 3 1.25 10 117 0.15 8
## 4 314 D S 3 0.75 5 117. 0.06 3
## 5 401 D S 4 0 0 118. 0 6
## 6 413 D S 4 0.3 5 116 0.09 10
These outputs give us a general idea about how the data appears. Note
that for the fourth variable called fdk (Fusarium damaged
kernels), there are two observations which are missing and coded as
NA.
tidyverse and many of its associated packages offer tools that enable us to manipulate and manage our data. Before we show you some examples of these function, we will learn about using ‘pipes’ to combine multiple functions. Pipes are a common programming tool that allow R and other programs to perform multiple functions on objects in sequence.
The most common type of pipe is %>%, which is
included when you load in tidyverse. Additionally, four different pipes
are available from the [magrittr package] (https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html),
and can be used to create more complicated and sophisticated piping
sequences.
Below, we show an example illustrating the use of the
%>% pipe in analyzing amounts of the mycotoxin
deoxynivalenol (DON) among samples:
# We create the vector 'DON', which contains DON measurements (in arbitrary units) across a set of samples:
DON <- c(0.1, 2.5, 7.5, 1, 0.9, 3.2, 4.5)
# Example 1:
# In a step-by-step fashion, we calculate the mean of the log-transformed DON values. For each step, we create a new object.
DON_log <- log(DON)
DON_log_mean <- mean(DON_log)
DON_log_mean
## [1] 0.4557823
# Example 2:
# We can also calculate the mean of the log-transformed DON values by 'nesting' functions within a single line:
DON_log_nest <- mean(log(DON))
DON_log_nest
## [1] 0.4557823
# Example 3:
# Finally, we can use a pipe to provide a logical flow to work with the data. The pipe links the steps of the calculation, doesn't result in the creation of multiple objects, and is more easily read and interpreted.
DON_log_pipe <- DON %>%
log(.) %>%
mean(.)
DON_log_pipe
## [1] 0.4557823
# In Example 3, the pipe takes the DON data, transforms it, calculates the mean, and transfers the result to create the object DON_log_pipe. This method doesn't result in populating the working environment with intermediary data objects and multiple lines of disconnected code (like in Example 1) or in potential errors and readability issues (like in Example 2). Using pipes effectively can reduce time troubleshooting and make it easier for you and others to understand what the code should be doing!
In the last few lines of code, the pipe takes the DON data,
transforms it, calculates the mean, and transfers the result to create
the object DON_log_pipe. This method doesn’t result in
populating the working environment with intermediary data objects and
multiple lines of disconnected code (like in Example 1) or in potential
errors and readability issues (like in Example 2). Using pipes
effectively can reduce time troubleshooting and make it easier for you
and others to understand what the code should be doing! In tidyverse,
and especially the package dplyr, there are several
excellent functions that help us to work with our data. The most
important functions can be divided into four operations:
filter, select, mutate, and
summarize.
The filter() function is used to select rows or
observations based on defined criteria, such as by quality scores,
treatment types, or minimum values and is included in base R’s stats
package.
# The filter() function works by defining the data source and the specific condition.
filter(data_demo, is.na(fdk))
## # A tibble: 2 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 108 A S 1 13.3 55 92.3 0.42 NA
## 2 110 A S 1 12.3 50 103. 0.26 NA
# In this case, we use 'is.na(fdk)' to filter only for rows where there are missing (NA) observations for the 'fdk' variable.
# Filtering can also be performed with pipes:
data_demo %>%
filter(is.na(fdk))
## # A tibble: 2 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 108 A S 1 13.3 55 92.3 0.42 NA
## 2 110 A S 1 12.3 50 103. 0.26 NA
When writing code, it is important to learn some of the important
logical operators. Logical operators are symbols or words that connect
two expressions to produce either ‘TRUE’ or ‘FALSE’ values. In our
current example, we are interested in those operators which work closely
with filter().
<, less than<=, less than or equal to>, greater than>=, greater than or equal to==, equal to!=, different to!x, Not x|y, x OR y& y, x AND yisTRUE(x), test if X is TRUEis.na(x), test if x is NALet’s practice filtering our demo data by different conditions with the use of logical operators:
# Here, we filter for data where column 'var' is exactly equal to 'R' (meaning resistant in this example data) by using the logical operator '=='. Then, we will print all observations where this condition is true.
data_demo %>%
filter(var == "R") %>%
print(n=Inf)
## # A tibble: 32 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 A R 1 1.95 35 86.6 0.11 4
## 2 109 A R 1 1.2 20 80.3 0.15 4
## 3 204 A R 2 0.9 20 81.1 0 42
## 4 214 A R 2 4.05 50 85.1 0.07 25
## 5 305 A R 3 1.1 15 84.8 0 29
## 6 316 A R 3 1.8 10 93.7 0 24
## 7 406 A R 4 1.4 25 84.5 0.06 37
## 8 412 A R 4 0.45 10 84.8 0.07 42
## 9 112 B R 1 3.1 15 105. 0 6
## 10 114 B R 1 0.65 10 109. 0 17
## 11 209 B R 2 1.05 20 99 0.06 25
## 12 216 B R 2 2.1 25 104. 0.05 14
## 13 304 B R 3 0.8 10 107. 0 8
## 14 312 B R 3 0.45 15 104. 0 12
## 15 404 B R 4 2.5 30 101. 0 25
## 16 407 B R 4 2.95 25 105. 0 14
## 17 102 C R 1 0.65 10 105. 0 0
## 18 116 C R 1 0.3 5 107. 0.05 9
## 19 201 C R 2 0.3 10 108. 0.07 13
## 20 206 C R 2 0.3 5 98.3 0.05 16
## 21 302 C R 3 1.05 20 111. 0 4
## 22 307 C R 3 0.6 10 109. 0 7
## 23 410 C R 4 0.3 5 110. 0.06 5
## 24 415 C R 4 0.3 10 108. 0.05 18
## 25 103 D R 1 0.75 15 105. 0 4
## 26 106 D R 1 0.3 5 99.7 0.06 3
## 27 208 D R 2 1.35 40 94.6 0.06 17
## 28 212 D R 2 0.45 10 102. 0.08 11
## 29 310 D R 3 0.45 10 104. 0 18
## 30 313 D R 3 0 0 103 0.07 18
## 31 402 D R 4 0.15 5 106. 0 11
## 32 414 D R 4 0 0 107. 0.07 18
# Now we filter our data by two conditions: column 'var' equals 'R', and column 'trt' equals 'A'. To do this, we use logical operators '==' and connect the two conditions with the '&' symbol.
data_demo %>%
filter(var == "R" & trt == "A") %>%
print(n=Inf)
## # A tibble: 8 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 A R 1 1.95 35 86.6 0.11 4
## 2 109 A R 1 1.2 20 80.3 0.15 4
## 3 204 A R 2 0.9 20 81.1 0 42
## 4 214 A R 2 4.05 50 85.1 0.07 25
## 5 305 A R 3 1.1 15 84.8 0 29
## 6 316 A R 3 1.8 10 93.7 0 24
## 7 406 A R 4 1.4 25 84.5 0.06 37
## 8 412 A R 4 0.45 10 84.8 0.07 42
# We can filter for observations where the column 'sev' (severity) contains a value equal to or greater than 5 with the logical operator '>='.
data_demo %>%
filter(sev >= 5) %>%
print(n=Inf)
## # A tibble: 5 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 108 A S 1 13.3 55 92.3 0.42 NA
## 2 110 A S 1 12.3 50 103. 0.26 NA
## 3 213 A S 2 5.5 60 93.4 0.2 6
## 4 306 A S 3 5.05 45 98 0.17 5
## 5 411 A S 4 5.65 15 99.9 0.31 7
# We can also filter by two conditions in sequence. In this example, we filter for observations where column 'fdk' contains a value less than 1. Then we use logical operator '|' to filter observations where column 'fdk' is less than 1 by the second condition, which is that column 'don' is equal to 0.
data_demo %>%
filter(fdk < 1 | don == 0) %>%
print(n=Inf)
## # A tibble: 21 × 9
## plot trt var blk sev inc yld don fdk
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 204 A R 2 0.9 20 81.1 0 42
## 2 305 A R 3 1.1 15 84.8 0 29
## 3 316 A R 3 1.8 10 93.7 0 24
## 4 112 B R 1 3.1 15 105. 0 6
## 5 114 B R 1 0.65 10 109. 0 17
## 6 304 B R 3 0.8 10 107. 0 8
## 7 312 B R 3 0.45 15 104. 0 12
## 8 404 B R 4 2.5 30 101. 0 25
## 9 407 B R 4 2.95 25 105. 0 14
## 10 111 B S 1 1.65 20 112. 0 2
## 11 102 C R 1 0.65 10 105. 0 0
## 12 302 C R 3 1.05 20 111. 0 4
## 13 307 C R 3 0.6 10 109. 0 7
## 14 101 C S 1 0.3 10 122 0.08 0
## 15 115 C S 1 0.3 5 115. 0 1
## 16 301 C S 3 0 0 118. 0 1
## 17 416 C S 4 0.45 15 123. 0 4
## 18 103 D R 1 0.75 15 105. 0 4
## 19 310 D R 3 0.45 10 104. 0 18
## 20 402 D R 4 0.15 5 106. 0 11
## 21 401 D S 4 0 0 118. 0 6
In some ways, the select() function works similarly to
filter(). The major difference is that
filter() functions work on observations, while
select() functions work on variables. Instead of
filtering through observations (rows) in our data, we can select and
un-select variables (columns).
# Here, we use select() on our example data to retain only desirable variables (columns):
data_demo %>%
select(trt, var, blk, sev, inc)
## # A tibble: 64 × 5
## trt var blk sev inc
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 A R 1 1.95 35
## 2 A R 1 1.2 20
## 3 A R 2 0.9 20
## 4 A R 2 4.05 50
## 5 A R 3 1.1 15
## 6 A R 3 1.8 10
## 7 A R 4 1.4 25
## 8 A R 4 0.45 10
## 9 A S 1 13.3 55
## 10 A S 1 12.3 50
## # ℹ 54 more rows
# We can achieve the same result by un-selecting undesirable variables as shown below:
data_demo %>%
select(-plot, -yld, -don, -fdk)
## # A tibble: 64 × 5
## trt var blk sev inc
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 A R 1 1.95 35
## 2 A R 1 1.2 20
## 3 A R 2 0.9 20
## 4 A R 2 4.05 50
## 5 A R 3 1.1 15
## 6 A R 3 1.8 10
## 7 A R 4 1.4 25
## 8 A R 4 0.45 10
## 9 A S 1 13.3 55
## 10 A S 1 12.3 50
## # ℹ 54 more rows
The mutate() function allows us to transform the
variables (columns) of our data. We can use mutate() to
change variable names and transform values, and to create new
variables.
# Here we use mutate() to create new variables 'sev_prop' (sev proportion) and 'inc_prop' (inc proportion). We transform values from original variables 'sev' and 'inc' to populate the cells in these new variable columns.
data_demo %>%
mutate(sev_prop = sev/100,
inc_prop = inc/100)
## # A tibble: 64 × 11
## plot trt var blk sev inc yld don fdk sev_prop inc_prop
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 A R 1 1.95 35 86.6 0.11 4 0.0195 0.35
## 2 109 A R 1 1.2 20 80.3 0.15 4 0.012 0.2
## 3 204 A R 2 0.9 20 81.1 0 42 0.009 0.2
## 4 214 A R 2 4.05 50 85.1 0.07 25 0.0405 0.5
## 5 305 A R 3 1.1 15 84.8 0 29 0.011 0.15
## 6 316 A R 3 1.8 10 93.7 0 24 0.018 0.1
## 7 406 A R 4 1.4 25 84.5 0.06 37 0.014 0.25
## 8 412 A R 4 0.45 10 84.8 0.07 42 0.0045 0.1
## 9 108 A S 1 13.3 55 92.3 0.42 NA 0.133 0.55
## 10 110 A S 1 12.3 50 103. 0.26 NA 0.123 0.5
## # ℹ 54 more rows
# In this example, we transform the 'yld' (yield) values from bushels/acre to kg/hectare. Then we create a new variable 'trt_var', that combines the character values in 'trt' and 'var' to create a concatenated character string, which can be used to identify different samples.
data_demo %>%
mutate(yld = yld*67.25,
trt_var = paste0(trt, "_", var))
## # A tibble: 64 × 10
## plot trt var blk sev inc yld don fdk trt_var
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 107 A R 1 1.95 35 5824. 0.11 4 A_R
## 2 109 A R 1 1.2 20 5400. 0.15 4 A_R
## 3 204 A R 2 0.9 20 5454. 0 42 A_R
## 4 214 A R 2 4.05 50 5723. 0.07 25 A_R
## 5 305 A R 3 1.1 15 5703. 0 29 A_R
## 6 316 A R 3 1.8 10 6301. 0 24 A_R
## 7 406 A R 4 1.4 25 5683. 0.06 37 A_R
## 8 412 A R 4 0.45 10 5703. 0.07 42 A_R
## 9 108 A S 1 13.3 55 6207. 0.42 NA A_S
## 10 110 A S 1 12.3 50 6920. 0.26 NA A_S
## # ℹ 54 more rows
We can use summarize() to quickly generate a summary of
our data. We can use a collection of other functions, such as
mean() or max(), within summarize to collect
many different kinds of summaries. Note that you can often use
alternative spellings (summarize vs. summarise) and R will likely
recognize the function.
# First, we group our example data by 'trt' to create meaningful summaries. Then we use statistical functions (mean(), max(), etc.,) within the summarize() function to collect statistics of the selected variables by trt group.
data_demo %>%
group_by(trt) %>%
summarise(sev = mean(sev),
inv = max(inc),
yld = sd(yld))
## # A tibble: 4 × 4
## trt sev inv yld
## <chr> <dbl> <dbl> <dbl>
## 1 A 3.94 60 7.79
## 2 B 1.64 35 7.19
## 3 C 0.572 20 6.67
## 4 D 0.438 40 7.13
# In this example, we show that you can group by multiple variables ('var' and 'trt') and then apply summarize to create a meaningful summary.
data_demo %>%
group_by(var, trt) %>%
summarize(sev = mean(sev),
inv = max(inc),
yld = sd(yld))
## `summarise()` has grouped output by 'var'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 5
## # Groups: var [2]
## var trt sev inv yld
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 R A 1.61 50 4.07
## 2 R B 1.7 30 3.04
## 3 R C 0.475 20 3.94
## 4 R D 0.431 40 3.98
## 5 S A 6.28 60 3.79
## 6 S B 1.58 35 4.69
## 7 S C 0.669 15 5.72
## 8 S D 0.444 20 2.82
Functions pivot_longer() and pivot_wider()
from the tidyverse collection allow us to reshape our data into long or
wide formats, depending on our needs. We can also remove variables that
we don’t wish to reshape at this time.
# Here we reshape our example data to a longer format using pivot_longer(). First we remove unnecessary variable 'plot' and then we collect all variables EXCEPT 'trt', 'var', and 'blk' to be stored in a single column, allowing our data to take on a longer format. We can change column headings at this time to more accurately describe their contents.
data_longer <- data_demo %>%
select(-plot) %>% # Remove 'plot' variable
pivot_longer(cols = -c(trt, var, blk), # Define variables NOT to be reshaped
names_to = "variables", # Name new column containing variable names
values_to = "values") # Name new column containing variable values
data_longer
## # A tibble: 320 × 5
## trt var blk variables values
## <chr> <chr> <dbl> <chr> <dbl>
## 1 A R 1 sev 1.95
## 2 A R 1 inc 35
## 3 A R 1 yld 86.6
## 4 A R 1 don 0.11
## 5 A R 1 fdk 4
## 6 A R 1 sev 1.2
## 7 A R 1 inc 20
## 8 A R 1 yld 80.3
## 9 A R 1 don 0.15
## 10 A R 1 fdk 4
## # ℹ 310 more rows
After working with your data, you may want to save any tables you
have created. There are several functions supporting the export of
common file types like .csv and .xlsx. In
contrast to the function we used to import our example data
data_demo <- readr::read_csv("data_demo.csv"), function
write_csv() from the readr package allows you
to save your data frame to your work directory as a .csv
file.
We can use the openxlsx function
write.xlsx() from package openxlsx to save any
tables to our work directory as Excel files. Note that
openxlsx is not a core package for tidyverse; it will need
to be installed and loaded independently with functions
install.packages() and library().
We can take this opportunity to save our resulting tables to a sub-folder within our work directory by including the name of the sub-folder within the file path.
# Install and load openxlsx package, if necessary:
install.packages("openxlsx", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/miran/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'openxlsx' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\miran\AppData\Local\Temp\RtmpGK02Jh\downloaded_packages
library("openxlsx")
# Define the data object you want to save and provide the path you want the file to go, including any sub-folders. Don't forget to include the desired name of the file ('example_output.xlsx') in the path!
write.xlsx(data_longer, # File that we want export
file = "./data/intro_r/example_output.xlsx") # location and name of our .xlsx
# You can further modify how you save your data with arguments such as sheetName (for Excel files with multiple sheets) and append, if you want to join the contents of this table to another file.
write.xlsx(data_longer, # File that we want export
file = "./data/intro_r/example_output.xlsx",
sheetName="First_example", # Sheet name within Excel file
append=FALSE) # If append=TRUE, you can specify another file to add data_longer to.
# If you'd like to save multiple tables together, you can also join them in a list within the R environment and save that object with write.xlsx().
two_data_frame <- list('ex_1' = data_demo, 'ex_2' = data_longer)
write.xlsx(two_data_frame, file = "./data/intro_r/two_sheets.xlsx")